Instructions¶
Lab sessions aim to:
- Show and reinforce how models and ideas presented in class are used in practice.
- Help you gather hands-on machine learning skills.
Lab sessions are:
- Learning environments where you work with Jupyter notebooks and where you can get support from TAs and fellow students.
- Not graded and do not have to be submitted.
Use of AI tools¶
AI tools, such as ChatGPT and Co-pilot, are great tools to assist with programming. Moreover, in your later careers you will work in a world where such tools are widely available. As such, we encourage you to use AI tools effectively (both in the lab sessions and assignments). However, be careful not to overestimate the capacity of AI tools! AI tools cannot replace you: you still have to conceptualise the problem, dissect it and structure it, to conduct proper analysis and modelling. We recommend being especially reticent with using AI tools for the more conceptual and reflective oriented questions.
Google Colab workspace set-up¶
Uncomment the following cells code lines if you are running this notebook on Colab
#!git clone https://github.com/TPM034A/Q2_2025
#!pip install -r Q2_2025/requirements_colab.txt
#!mv "/content/Q2_2025/Lab_sessions/lab_session_03/data" /content/data
Application: Predicting perceived visual neighbourhood attractiveness
In this lab session we will use various ML models, namely Linear regressions (LR), Random forests (RF), Multi-layer perceptrons (MLP), and Ensembles (E), to predict the perceived visual attractiveness of neighbourhoods. Understanding the visual attractiveness of neighbourhoods is important for various reasons. For instance, municipalities need to know which neighbourhood need attention because they are visually unattractive. Moreover, visual attractiveness of neighbourhood is important to understand and predict residential location choices, house prices, and tourist destination choices.
In this lab session we aim to develop a computationally efficient ML model that is capable of mapping urban images (i.e. a Google Street view image) to visual attractiveness levels.
Where do the true labels come from?
Since we use supervised learning, we need to have data containing the true labels. But where do the true labels for visual attractiveness come from? The true labels come from a computer vision model that is trained on data from a so-called discrete choice experiment. In this experiment, people were placed in the hypothetical situation that they had to move to a different neighbourhood and were given two options. The visual attractiveness is learned from their choices.
See the figure below for an example of one such choice task. More information about the survey and the model can be found in this paper.

Why do we need a another model to (re)do the mapping?
The mapping from image to visual attractiveness indeed already comes from a model. Therefore, one may wonder why do we need another model? Applying the computer vision model is computationally expensive. Moreover, it requires a GPU to process images in large quantities. Therefore, having a good, and computationally efficient, proxy model is useful. For instance, to build a map like the one below for Delft requires processing over 400k images.

Data for this lab
In this lab session you will work with four datasets:
data/image_tabular/image_metadata.csv: A table csv file with image metadata (e.g., year, month or location) of Rotterdam images.data/image_tabular/image_embeddings.csv: A table csv file with image embeddings of images from Rotterdam. (We will explain what image embeddings are later).data/geo/hexagons.gpkg: A geospatial dataset of Rotterdam.data/images: A folder with image files from Rotterdam (read below for more details).
The first three data files are already in the data folder associated with this lab. The images folder still needs to be downloaded. The full image data set is fairly large, 14 GB, and contains 101,444 images from Rotterdam. You can download the full dataset if you want to work and explore all the images by your own, but this is not required for completing this lab session. To conduct this lab session, we have created a sub-set of the data set containing 1,000 images which is only 140 MB. This allow you to conduct the visualisations of this lab.
The following cell will download the sub-set of images and place them in the data folder automatically. It can takes up to one minute to download the images. If you want to download the full dataset, just modify the the variable download_full_dataset to True.
## IMPORTANT: You have to be on the TUDelft network (eduroam) or under eduVPN to run this script
download_full_dataset = False
from assets import image_downloader as imd
imd.download_images(download_full_dataset)
Downloading images... Download complete! Unzipping images... Unzip complete! Removing zip file... Done!
Learning objectives. After completing the following exercises you will be able to:
- Train multiple ML models, including
Linear regressions,Random ForestsandMulti-Layer Perceptrons - Identify the most important features and their impact on the target feature
- Work with embeddings of images
- Explore the pros and cons of each model, and how models can complement each other to answer specific research questions
Organisation¶
This lab session comprises 3 parts:
Loading and exploring the data
1.1. Reading the medatada and geospatial file
1.2. Exploring the image metadata
1.3. Exploring the geospatial data
1.4. Visual inspection of imagesImage embeddings
2.1. Embedding model
2.3. Exploring the embeddingsTraining multiple models: Regression model, Random forest and MLP for predicting attractiveness
3.1. Preparing the dataset
3.2. Linear multiple regression model
3.4. Random Forest
3.5. Multilayer perceptrons
3.6. Comparing and reflecting on the model performances and their outcomes
# Basic libraries
import numpy as np
import pandas as pd
import geopandas as gpd
# ML tools
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_validate
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, make_scorer,log_loss
# Models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, VotingRegressor
from sklearn.neural_network import MLPRegressor
# Visualization libraries
import seaborn as sns
from matplotlib import pyplot as plt
from mpl_toolkits.axes_grid1 import ImageGrid
from matplotlib.ticker import FixedLocator
from sklearn.tree import plot_tree
from branca.element import Figure
# Other libraries
from pathlib import Path
from shapely.geometry import Point
from PIL import Image
import pickle
1. Loading and exploring the data¶
1.1. Reading the medatada and geospatial data¶
Before creating models, we must understand our datasets. So, first open the image metadata dataset (data/image_tabular/image_metadata.csv) which contains general info (metadata) about the images, and the geospatial dataset (data/geo/hexagons.gpkg) which contains the spatial zones we will work with.
# Data folder path
data = Path('data')
# Reading image df
img_metadata = pd.read_csv(data/'image_tabular'/'image_metadata.csv')
# Reading hexagons gdf
hexagons = gpd.read_file(data/'geo'/'hexagons.gpkg')
# Verify geographic coordinate system / projection
print(hexagons.crs)
EPSG:4326
1.2. Exploring the image metadata¶
The img_metadata DataFrame contains the metadata of thousands of Street View Images (SVI) from Rotterdam. With head() we can explore its different columns available.
img_metadata.head(5)
| img_id | img_path | year | month | lat | lng | hex_id | attractiveness | in_folder | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | image_261669_s_a.png | 2019 | 5 | 51.992875 | 4.595223 | 8b19694da002fff | 0.745663 | 0 |
| 1 | 1 | image_261609_s_a.png | 2021 | 1 | 51.992723 | 4.595357 | 8b19694da003fff | 0.729360 | 0 |
| 2 | 2 | image_261683_s_a.png | 2017 | 7 | 51.993201 | 4.594991 | 8b19694da006fff | 0.767513 | 0 |
| 3 | 3 | image_261590_s_a.png | 2016 | 7 | 51.991961 | 4.595907 | 8b19694da008fff | 0.721686 | 0 |
| 4 | 4 | image_261599_s_a.png | 2017 | 7 | 51.992304 | 4.595650 | 8b19694da00efff | 0.572475 | 0 |
The dataset contains:
- img_id: A unique identifier of an individual image
- img_path: Image filename in the image folder
- year: Year when the picture was taken
- month: Month when the picture was taken
- lat: geospatial latitude of the image
- lng: geospatial longitude of the image
- hex_id: geospatial hexagon id where the image was taken (see more details below)
- attractiveness: Numerical value which represent the perceived attractiveness of the image
- in_folder: Binary column indicating if the image is in the sub-set of images or not
We look at general statistics of img_metadata using describe():
img_metadata.describe()
| img_id | year | month | lat | lng | attractiveness | in_folder | |
|---|---|---|---|---|---|---|---|
| count | 101444.000000 | 101444.000000 | 101444.000000 | 101444.000000 | 101444.000000 | 101444.000000 | 101444.000000 |
| mean | 50721.500000 | 2015.483341 | 6.995032 | 51.918793 | 4.490640 | -0.285988 | 0.009858 |
| std | 29284.504691 | 4.358636 | 2.427367 | 0.030294 | 0.048692 | 0.529770 | 0.098796 |
| min | 0.000000 | 2008.000000 | 1.000000 | 51.852995 | 4.380001 | -2.526708 | 0.000000 |
| 25% | 25360.750000 | 2014.000000 | 6.000000 | 51.891868 | 4.457293 | -0.669447 | 0.000000 |
| 50% | 50721.500000 | 2016.000000 | 6.000000 | 51.919943 | 4.489459 | -0.319260 | 0.000000 |
| 75% | 76082.250000 | 2019.000000 | 9.000000 | 51.943226 | 4.530447 | 0.078983 | 0.000000 |
| max | 101443.000000 | 2022.000000 | 12.000000 | 51.994166 | 4.601524 | 1.748261 | 1.000000 |
# Histogram of the attractiveness at image level
fig, ax = plt.subplots(figsize=(6, 3))
ax.hist(img_metadata['attractiveness'], bins=100)
ax.set_xlabel('Visual attractiveness')
ax.set_ylabel('Frequency')
ax.set_title('Histogram of the attractiveness at image level')
plt.show()
Here we can see:
- that the distribution of the attractiveness values is skewed a bit to the right. This means that there are more images with a more positive attractiveness value than negative.
Now let's briefly see how many images we have per geospatial hexagon. For this we will use the hex_id column. This column contains a unique identifier for each hexagon.
Exercise 1: Explore the statistics and geospatial distribution of the images¶
A First, verify how many images are available per hexagon. Visualize by table or plot.
B Create histograms of images by year and month. Comment what you see.
C Visualize on a map the distribution of year and month (To convert the DataFrame into a GeoDataFrame use the method provided below). Create one map for each year (2008-2022), and for each month (Jan-Dec). Interpret your results.
D Do you think the month and year of the images could impact on the (predicition of) perceived visual attractiveness?
def dataframe_to_geodataframe_nl(original_dataframe, latitud_column_name, longitude_column_name):
'''
This Function converts a dataframe into a geodataframe using the latitud and longitude columns.
The output will be ready to use the plot function from geopandas.
'''
## Creating the Point geometry using lat/lng columns
original_dataframe['geometry'] = original_dataframe.apply(lambda x: Point(x[longitude_column_name], x[latitud_column_name]), axis=1)
## Creating the geodataframe (we used crs 4326 because it is the code for the latitud and longitude)
geodataframe = gpd.GeoDataFrame(original_dataframe, geometry='geometry', crs=4326)
## Changing the crs to the same as the hexagons This is 28992, the projection used in the Netherlands.
geodataframe = geodataframe.to_crs(28992)
return geodataframe
# ADD HERE YOUR ANSWER TO EXERCISE 1
# A. Checking the number of images per hexagon
hex_num_imgs = img_metadata[['hex_id', 'img_id']].groupby('hex_id').count()
hex_num_imgs = hex_num_imgs.rename(columns={'img_id': 'num_imgs'})
fig, ax = plt.subplots(figsize=(6, 3))
ax.hist(hex_num_imgs['num_imgs'])
ax.set_xlabel('Number of images per hexagon')
ax.set_ylabel('Frequency')
ax.set_title('Histogram of the number of images per hexagon')
plt.show()
# B. Creating the histogram of images by year and month
fig, ax = plt.subplots(1, 2, figsize=(10, 3))
ax[0].hist(img_metadata['year'], bins=15)
ax[0].set_xlabel('Year')
ax[0].set_ylabel('Frequency')
ax[0].set_title('Histogram of the year of the images')
ax[1].hist(img_metadata['month'], bins=12)
ax[1].set_xlabel('Month')
ax[1].set_ylabel('Frequency')
ax[1].set_title('Histogram of the month of the images')
plt.show()
# The histogram by years shows a gap between 2011 and 2013. Also 2014 is the year with more images.
# The histogram by month shows that most images are from summer, especially from July.
# C. To visualize the spatial distribution of the data, we can convert the dataframe into a geodataframe using the function provided in the lab session.
img_metadata_gdf = dataframe_to_geodataframe_nl(img_metadata, 'lat', 'lng')
## Now we can plot the data
# For years
fig, axes = plt.subplots(nrows=3, ncols=5, figsize=(18, 10))
for ax, year in zip(axes.flatten(), range(2008, 2023)):
if len(img_metadata[img_metadata.year == year]) > 0:
img_metadata_gdf[img_metadata.year == year].plot(ax=ax, markersize=0.5, alpha=0.5)
ax.set_title('Year {}'.format(year))
ax.set_axis_off()
plt.show()
# The same for each month
fig, axes = plt.subplots(nrows=3, ncols=4, figsize=(15, 10))
for ax, month in zip(axes.flatten(), range(1, 13)):
if len([img_metadata.month == month]) > 0:
img_metadata_gdf[img_metadata.month == month].plot(ax=ax, markersize=0.5, alpha=0.5)
ax.set_title('Month: {}'.format(month))
ax.set_axis_off()
plt.show()
# The maps show that from 2014 images have been taken broadly across the city. Most of the images are from summer time where also there is a broader coverage.
# D. Summer pictures maybe are more attractive than winter pictures beacuse there are more green vegetation.
1.3. Exploring the geospatial data¶
The second dataset hexagons corresponds to a geospatial dataset of hexagons.
hexagons.head(5)
| hex_id | geometry | |
|---|---|---|
| 0 | 8b19694da002fff | POLYGON ((4.59455 51.99293, 4.5946 51.9927, 4.... |
| 1 | 8b19694da003fff | POLYGON ((4.59496 51.99263, 4.595 51.9924, 4.5... |
| 2 | 8b19694da006fff | POLYGON ((4.59482 51.99331, 4.59486 51.99309, ... |
| 3 | 8b19694da008fff | POLYGON ((4.59577 51.99204, 4.59582 51.99181, ... |
| 4 | 8b19694da00efff | POLYGON ((4.59537 51.99234, 4.59541 51.99211, ... |
The dataset contains only two columns. The first one, hex_id, it is just a unique identifier for each hexagon. The second one, geometry, corresponds to the coordinates of the shape and location of each hexagon. The following code, allow us to visualize each hexagon on the map using the method explore() from GeoPandas
fig = Figure(width=400, height=300)
fig.add_child(hexagons.explore(marker_kwds={'radius': 10}, zoom_start=15))
display(fig)